Transcriptomics with RNA-seq

Jelmer Poelstra

MCIC Wooster, OSU

2024-02-01

Intro to transcriptomics & RNA-seq

Recap: central dogma & omics


Alternative splicing & isoforms

https://en.wikipedia.org/wiki/Protein_isoform

The transcriptome

The transcriptome is the full set of transcripts expressed by an organism, which:

  • Is not at all stable across time & space in any given organism,
    unlike the genome but much like the proteome

  • Varies both qualitatively (which transcripts are expressed) but especially quantitatively (how much of each transcript is expressed)

Transcriptomics

Transcriptomics is the study of the transcriptome, i.e. the large-scale study of RNA transcripts expressed in an organism.


There are many approaches & applications — but most commonly, transcriptomics focuses on:

  • mRNA rather than on noncoding RNA types such as rRNA, tRNA, and miRNA

  • Quantifying gene expression levels (cf. most genomics approaches)

  • Statistically comparing expression levels between groups (treatments, biotypes, tissues, etc.)


https://hbctraining.github.io

Transcriptomics

Transcriptomics is the study of the transcriptome, i.e. the large-scale study of RNA transcripts expressed in an organism.


There are many approaches & applications — but most commonly, transcriptomics focuses on:

  • mRNA rather than on noncoding RNA types such as rRNA, tRNA, and miRNA

  • Quantifying gene expression levels (cf. most genomics approaches)

  • Statistically comparing expression levels between groups (treatments, biotypes, tissues, etc.)


Why do transcriptomics?

Considering…

  • That protein production gives clues about the activity of specific biological functions, and the molecular mechanisms underlying those functions;

  • That it is much easier to measure protein expression than transcript expression at large scales;

  • The central dogma

… we can use gene expression levels as a proxy for protein expression levels and make functional inferences.


Why do transcriptomics? (cont.)

Specifically, we can use transcriptomics to:


  • Compare & contrast phenotypic vs. molecular responses/differences

  • Find the pathways and genes that:

    • Underlie phenotypic responses

    • Explain differences between groups (treatments, genotypes, sexes, tissues, etc.)

    • Can be targeted to enhance or reduce responses for pathogen and pest control

What is RNA-seq?

RNA-seq is the current state-of-the-art family of methods to study the transcriptome.

It involves the random (“shotgun”) sequencing of millions of transcript fragments per sample.


We will focus on the most common type of RNA-seq, which:

  • Does not actually sequence the RNA, but first reverse transcribes RNA to cDNA

  • Attempts to sequence only mRNA while avoiding noncoding RNAs (“mRNA-seq”)

  • Does not distinguish between RNA from different cell types (“bulk RNA-seq”)

  • Uses short reads (up to 150 bp) that do not cover full transcripts but do uniquely identify genes


Which type of RNA-seq?

When people say “RNA-seq” without further specifications, you can assume they’re talking about this type.

Other RNA-seq applications

RNA-seq data can also be used for applications other than expression quantification, such as to:

  • Identify & analyze SNPs (for population genetics, molecular evolution, functional associations, etc)
  • For organisms without a reference genome: characterize the transcriptome, i.e. identify genes present in the organism

  • For organisms with a reference genome: discover new genes & transcripts, and improve genome annotation


All in all, RNA-seq is a very widely used technique —
it constitutes the most common usage of high-throughput sequencing!

RNA-seq project examples

RNA-seq is also the most common data type I assist with as a bioinformatician at the MCIC. Some projects I’ve been involved in used RNA-seq to identify genes & pathways that differ between:

  • Multiple soybean cultivars in response to Phytophtora sojae inoculation; soybean in response to different Phytophtora species and strains (Dorrance lab, PlantPath)

  • Wheat exposed to Xanthomonas with a gene knock-out vs. knock-in (Jacobs lab, PlantPath)


  • Mated and unmated mosquitos (Sirot lab, College of Wooster)

  • Tissues of the ambrosia beetle and its symbiotic fungus (Ranger lab, USDA Wooster)

  • Diapause-inducing conditions for two pest stink bug species (Michel lab, Entomology)


  • Human carcinoma cell lines with vs. without expression manipulation of a gene (Cruz lab, CCC)

  • Pig coronaviruses with vs. without an experimental insertion (Wang lab, CFAH)


As well as to improve the genome annotation of a plant-parasitic nematode (Taylor lab, PlantPath).

RNA-seq project examples (cont.)

[TODO: 1 or 2 figs, e.g. from Tracy]

Other ways of quantifying gene expression

RNA-seq is not the only method to quantify gene expression, but often the preferred one:


  • Reverse-transcription qPCR (RT-qPCR)
    A low-throughput method that can quantify expression levels for one or a few genes at a time

  • Microarrays — hybridization of cDNA to preconstructed probes
    Microarrays were the transcriptomics method of choice before the advent of high-throughput sequencing, but disadvantages relative to RNA-seq include:

    • Produce relative measurements only with a much smaller dynamic range than RNA-seq

    • Can only capture known features: cannot discover or examine novel features, and do not work for species without a (close) reference genome

Some other types of RNA-seq

  • Small RNA-seq for small noncoding RNAs such as microRNAs and Piwi-interacting RNAs

  • Single-cell RNA-seq (scRNA-seq) — commonly used in cancer research

  • Direct RNA sequencing


Our focus

The rest of this lecture and tomorrow’s lab will only focus on bulk RNA-seq of mRNA-derived cDNA.

Experimental design

Experimental design: groups & replicates

RNA-seq typically compares groups of samples defined by differences in:

  • Treatments (e.g. different host plant, temperature, diet, mated/unmated) and/or

  • Organismal variants: ages/developmental stages, sexes, or genotypes (lines/biotypes/subspecies/morphs) and/or

  • Tissues

Experimental design: groups & replicates

https://github.com/ScienceParkStudyGroup/rnaseq-lesson

Experimental design: groups & replicates

To be able to make statistically supported conclusions about expression differences between such groups of samples, we must have biological replication.

When designing an RNA-seq experiment, keep the following in mind:

  • Numbers of replicates
    These are typically quite low: 3 replicates per treatment (x tissue x biotype, etc.) is the most common. Not advisable to go lower — if possible, use 4 or 5 replicates.
  • Technical replicates
    You won’t need technical replicates that only replicate library prep and/or sequencing, but depending on your experimental design, may want to tecnically replicate something else.
  • Statistical comparison design
    Preferably, keep your design relatively simple with 1-2 independent variables and 2-3 levels for each of them. Specifically, pairwise comparisons are easiest to interpret.

Experimental design side note

Dual RNA-seq

In agricultural research, it is common to study both a plant host and its adversary.

If the adversary is a microbial/viral/fungal/oomycete pathogen, it is possible or necessary to collect samples that contain both host and pathogen.


When these samples are subjected to RNA-seq, they will contain both host and pathogen reads.

These reads can be separated bioinformatically and when both host and pathogen reads are analyzed, this is called a dual RNA-seq experiment.

From samples to reads

From samples to reads: overview of steps

https://sydney-informatics-hub.github.io/training-RNAseq-slides

Overview of library prep steps


  • Many samples can be “multiplexed” into a single RNA-seq library












Fig. from Kukurba & Montgomery 2015 (www.ncbi.nlm.nih.gov/pmc/articles/PMC4863231/)

RNA extraction

The first step is to extract RNA from tissue samples. This is typically the only step you conduct yourself in the lab, with sequencing facilities taking care of downstream procedures.


Some considerations and recommendations:

  • The tissue sample should be collected from an organism that is alive or recently deceased (matter of minutes), and the RNA should be extracted immediately.

  • With the appropriate buffer and temperate, an RNA extraction can be stored long-term.

  • Include a DNase treatment step to avoid DNA contamination.

  • Perform Quality Control (QC) on your extractions: quantity (e.g. with a Nanodrop or Qubit) and quality (degradation; e.g. with a gel). Sequencing facilities can additionally check the RIN score with a “Bioanalyzer”.

RIN (RNA Integrity Number) score

https://www.novogene.com/eu-en/resources/blog/rna-integrity-number-rin-explained

  • The RIN score quantifies RNA quality: the lower the RIN score, the more degraded your RNA is likely to be.


  • It makes use the expectation of pronounced peaks corresponding to highly abundance, intact 18S and 28S rRNAs.

Library prep options

As discussed last week, “library preparation” is the series of lab steps to produce, from (DNA or) RNA extractions, a collection of molecules ready to be sequenced.


Let’s consider the different options you have for RNA-seq library prep:

  • mRNA enrichment
    Needed because mRNA makes up only a few percent of RNA molecules, and done using either:

    • PolyA-selection: specifically selects mature (spliced) mRNAs

    • Ribodepletion: fishes out rRNA — this is suboptimal for mRNA-seq (presence of other RNAs and of pre-mRNAs) but can be a last resort with poor RNA quality


  • Library strandedness
    Libraries can be “stranded” or “unstranded” — stranded libraries allow you to tell the directionality of a transcript, which in turn allows you to:

    • Distinguish between overlapping genes

    • Assess whether you may have DNA contamination

Sequencing

  • Sequencing technology

    • Illumina short reads are by far the most common

    • PacBio or ONT long reads are an alternative if sequencing full transcripts (isoforms) is key


  • Single-end vs. paired-end reads (for Illumina)

    • Paired-end has limited added value for reference-based, gene-level workflows (but can be key in other scenarios) — but it is still common as prices are often similar

  • Sequencing “depth” — how many reads per sample

    • Guidelines highly approximate (cf. in genomics) — depends not just on transcriptome size, but also on expression level distribution, expression levels of genes of interest, etc.

    • Typical recommendations are 20+ million reads per sample (more for e.g. transcript-level inferences)

    • For statistical power, more replicates are better than a higher sequencing depth

Number of replicates vs. sequencing depth

For statistical power, more replicates are better than a higher sequencing depth:

From Liu et al. 2014

From reads to counts

Overview of steps

Modified after Kukurba & Montgomery 2015

From reads to counts: overview

You will typically receive a “demultiplexed” (split by sample) set of FASTQ files.

Once you receive your data, the first series of analysis steps involves going from the raw reads to a count table (the count table will have a read count for each gene in each sample).


This part is bioinformatics-heavy with large files, a need for lots of computing power such as with a supercomputer, command-line (Unix shell) programs — it specifically involves:

  1. Read preprocessing: QC, trimming, and optionally rRNA removal

  2. Aligning reads to a reference genome (+ alignment QC)

  3. Quantifying expression levels (per-gene & per-sample to create a count table)


This can be run using standardized, one-size-fits-all workflows, and is therefore (relatively) suitable to be outsourced to a company, facility, or collaborator.

Reads to counts: Read pre-processing

Read pre-processing includes the following steps:


  • Checking the quantity and quality of your reads

    • Does not change your data, but helps decide next pre-processing steps / sample exclusion

    • Also useful to check for contamination, library complexity, and adapter content

    • Typically done with the tool FastQC (and MultiQC for across-sample summaries)


  • Removing unwanted sequences
    • Adapters, low-quality ends, and very short reads — with a tool like TrimGalore or Trimmomatic.
    • rRNA-derived reads — with SortMeRNA (optional).
    • Contaminant sequences — with a taxonomic read classification program like Kraken2 (optional).

Recap: isoforms & alternative splicing

Transcript-level vs. gene-level

The terminology “transcript-level” vs. “gene-level”, e.g. in “transcript-level counts” refers to the distinction between having separate counts for each isoform (AKA transcript) versus a single count for each gene. Gene-level counts may either have been aggregated across isoforms, or reads were never assigned to isoforms in the first place.

Reads to counts: Alignment to a reference genome

The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts).

Berge et al. 2019 (www.annualreviews.org/doi/10.1146/annurev-biodatasci-072018-021255)

Reads to counts: Alignment to a reference genome

The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts).


Alignment usually consists of two steps:

  • One-time creation of a tool-specific “index” of the reference genome/transcriptome.

  • For each sample independently, alignment of the reads to the reference index.


Alignment tools

Some of the currently most commonly used alignment tools are HISAT2 & STAR (map to genome), and RSEM & Salmon & Kallisto (map to transcriptome).

What if I don’t have a reference genome?

Without a reference genome, an RNA-seq project is considerably more work.
These two extra steps are needed:


  • De novo transcriptome assembly
    First, your reads can be used to first assemble a transcriptome from scratch with a tool like Trinity. (This is quite compute-intensive!)

  • Transcriptome annotation
    Then, your transcriptome will need to be functionally annotated:

    • This means you try to assign gene names and functional categories to your transcripts

    • A very similar process to genome annotation, and works in part by using BLAST or related tools to find orthologous, already annotated, genes in other organisms.


Then, you can align to your newly assembled transcriptome the same way you would align to a reference transcriptome (obviously, you won’t have the option to align to the genome).

Reads to counts: alignment QC

These kind of stats can only be assessed after aligning to the genome (vs. to the transcriptome)1.

  • Alignment rates
    What percentage of reads was successfully aligned?

    • Low rates (<80%) => cross-species contamination or poor reference genome quality

    • Multi-mapped reads (mapping to multiple locations in the genome) are typically accepted by the mapper but are not or probabilistically quantified — more on this later.


  • Alignment targets
    What percentages of aligned reads mapped to exons vs. introns vs. intergenic regions?

    • High intronic mapping rates => presence of pre-mRNA

    • High intergenic mapping rates => DNA contamination or poor genome assembly/annotation quality

Reads to counts: alignment QC

  • Across-gene coverage distribution
    An uneven distribution may indicate high levels of RNA degradation

  • Outlier samples
    Are patterns consistent across samples or are there outliers that may need to be removed?


Alignment QC tools

You can get these stats by checking the aligner’s “log” outputs, and by running tools like QualiMap and RSeQC.

Reads to counts: quantification

At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.


Several metrics of gene expression levels exist:

  • Adjusted by gene length and library size
    E.g., FPKM (superseded) and TPM (Transcripts Per Million)

  • Raw counts
    These are used by most downstream approaches (more later).


Current best-performing tools (Salmon, Kallisto) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.


Fast-moving field

Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citation) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.

A best-practice workflow to produce counts

The “nf-core” initiative (https://nf-co.re) attempts to produce best-practice workflows/pipelines using the Nextflow workflow language.

This entire workflow can be run with a single command!

https://nf-co.re/rnaseq

From counts to conclusions

Count table analysis: overview

The second part of RNA-seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.

DIY!

This part also more custom and iterative and I do not recommend to have this done by a third party.


Typically done using the specialized RNA-seq “packages” in the R language. Common steps involve:

  • PCA
    Assessing overall sample clustering patterns with a Principal Component Analysis (PCA)

  • Differential expression analysis
    Finding genes that differ in expression level between sample groups (DEGs)

  • Functional enrichment analysis
    See whether certain gene functions are overrepresented among DEGs

PCA &

Two or more groups are typically compared, such as plants exposed to virulent vs. avirulent aphids.

  • A PCA can tell you whether there’s an overall difference in the plant’s transcriptional response between treatments

Fig. 1 from Garrigos et al. 2023

Differential expression (DE) analysis

  • A Differential Expression (DE) analysis allows you to quantify the magnitude of this difference, and especially to pinpoint specific genes with statistically significant differences in expression levels (Differentially Expressed Genes, DEGs).

[TODO: Volcano plot and/or gene plot]

DE analysis: >1 pairwise comparison

More than two groups

Plants exposed to:

  • A virulent pathogen
  • An avirulent pathogen
  • A mock inoculation

Pairwise comparisons:

  • Virulent vs. avirulent
  • Mock vs. virulent
  • Mock vs. avirulent


More than one independent variable

  • Plant type: resistant & susceptible
  • Pathogen type: virulent & avirulent
  • Resistant plant vs. virulent pathogen
  • Resistant plant vs. avirulent pathogen
  • Susceptible plant vs. virulent pathogen
  • Susceptible plant vs. avirulent pathogen

DE analysis: other designs

Statistical designs other than pairwise comparisons are also possible. For example:

  • Time series (or other continuous variables; regression analysis)
    Which genes have monotonous in- or decreases across three or more time points?

  • Controlling for an independent variable (fixed or random effect)
    Assessing the effect of one variable while controlling for another. E.g.:

    • Which genes respond similarly to a virulent (vs. avirulent) pathogen in resistant and susceptible plants?

    • Which genes respond similarly to a virulent (vs. avirulent) pathogen regardless of field plot?


  • Interaction effects between two independent variables
    E.g.: which genes respond differently to a virulent (vs. avirulent) pathogen in resistant and susceptible plants?

DE analysis: general statistical considerations

  • Gene count normalization
    To be able to fairly compare samples, raw counts need to be adjusted:

    • By library size, which is the total number of gene counts per sample
    • By library composition, e.g. to correct for sample-specific extremely abundant genes that “steal” most of that sample’s counts

What about gene lengths?

Even though longer genes are more likely to be samples and gene counts are therefore confounded by gene length, there is no normalization by gene length, because genes are not compared.

  • Probability distribution of the count data

    • Gene counts have higher variance than a Poisson distribution: negative binomial distribution is typically used.
    • Variance (“dispersion”) estimates are gene-specific but “borrow” information from other genes (details beyond the scope of this lecture).

DE analysis: general statistical considerations (cont.)

  • Multiple-testing correction

    • 10,000+ genes are independently tested during a DE analysis, so there is a dire need for multiple testing correction.

    • The standard method is not the very strict Bonferroni correction but the Benjamini-Hochberg (BH) method.


  • Log2-fold changes (logFC/LFC) as a measure of expression difference

    • TODO

Specialized R/Bioconductor packages like DESeq2 and EdgeR make differential expression analysis relatively straightforward and automatically take care of the abovementioned considerations (we will use DESeq2 in the lab).

Functional enrichment: introduction

Lists of DEGs can be quite long, and it is not always easy to make biological sense of them. Functional enrichment analyses help with this.


They check whether certain functional categories of genes (e.g., biological processes, or pathways) are statistically overrepresented among up- and/or downregulated genes.

Functional enrichment: database overview

There are a number of databases that group genes into functional categories, but the two main ones used for enrichment analysis are:

  • Gene Ontology (GO)

  • Kyoto Encyclopedia of Genes and Genomes (KEGG)


Genome annotations for a specific organism often but not always include GO or KEGG annotations.

KEGG annotations are more commonly missing but can also be more easily generated, by uploading a FASTA file with amino acid sequences to KEGG’s GhostKOALA webservice: https://www.kegg.jp/ghostkoala/.

Functional enrichment: GO

  • Genes are assigned zero, one or more GO “terms

  • Hierarchical structure with more specific terms grouping into more general terms

  • Consists of three “ontologies”:

    • Biological Process

    • Molecular Function

    • Cellular Component

Functional enrichment: KEGG

  • Orthologs, modules and pathways

  • Contains fewer genes than GO

[Include pathway figure]

Functional enrichment: statistics

The two most common statistical approaches are:

  • Overrepresentation Analysis (ORA)
    With a list of genes divided in two groups (DE vs. not-DE1), test whether the groups contain different frequencies of each functional gene category, like a Chi-Square/Fisher’s Exact Test:
DE not-DE
In photosynthesis category 15 30
Not in photosynthesis category 85 14,870

  • Gene Set Enrichment Analysis (GSEA)
    Uses the expression level change (log-fold changes, LFC) of each gene and does not take statistical significance into account. It basically asks whether the LFCs of genes in a category are significantly shifted away from zero, either upregulated or downregulated.

Questions?